perm filename VIS[0,BGB]4 blob sn#073900 filedate 1973-11-25 generic text, type T, neo UTF8
COMMENT ⊗   VALID 00021 PAGES 
RECORD PAGE   DESCRIPTION
 00001 00001
 00003 00002	2.0	Computer Vision Theory.
 00004 00003	2.1	Introduction to Computer Vision Theory.
 00008 00004	2.2	Computer Vision Tasks.
 00010 00005	
 00014 00006	
 00016 00007		TABLE OF 3-D COMPUTER VISION TASKS.
 00019 00008	2.3	Mobile Robot Vision.
 00021 00009	
 00023 00010	2.4	Vision Systems.
 00030 00011	2.5	The Vision Cycle.
 00032 00012	2.6	The Nature of Images.
 00035 00013	2.7	The Nature of Worlds.
 00037 00014	2.8	Locus Solving.
 00039 00015	2.9	Related Work.
 00043 00016	2.10	Computer Vision and Artificial Intelligence.
 00047 00017	
 00051 00018	2.11	Visual Consciousness.
 00055 00019	2.12	Summary of Arguments.
 00056 00020	2.13	Future Vision Work.
 00057 00021	2.X	Social Consquences.
 00059 ENDMK
⊗;
2.0	Computer Vision Theory.

	2.1	Introduction.
	2.2	Vision Tasks.
	2.3	Mobile Robot Vision.
	2.4	Vision Systems.
	2.5	The Vision Cycle.
	2.6	The Nature of Images.
	2.7	The Nature of Worlds.
	2.8	Locus Solving.
	2.9	Related Work.
	2.10	Computer Vision and Artificial Intelligence.
	2.11	Visual Consciousness.
	2.12	Summary of Arguments.
	2.13	Future Vision Work.

2.1	Introduction to Computer Vision Theory.

	Vision  is the  act  or power  of  seeing.   Computer  vision
concerns programming a  computer to do a task that demands the use of
an image forming  light sensor,   such as a  television camera.   The
theory I  intend to elaborate is  that normal vision is  a continuous
process  of  keeping  an  internal  visual  simulator  in  sync  with
perceived images of the external reality, for the sake of some goal.

	In this  chapter,   several levels of  theory are  presented.
There is general  theory, which is my interpretation  of the state of
the art of  computer vision.   There is  the special  theory,   which
lead to  the  particular design  choices I  have made.   There  are
alternate theories  and designs, which are mentioned  for the sake of
contrast.  Finally, there is my personal world view on the  nature of
visual perception  and consciousness.   The word  "theory",   as used
here,   means simply a set of statements presenting a systematic view
of a subject. Specifically,  I wish to exclude  the connotations that
the theory  is a  mathematical theory or  a natural theory.   Perhaps
there can be  such a thing  as an "artificial  theory" which  extends
from the philosophy thru the design of an artifact.

	Although, such an  artificial theory is  ultimately validated
by  the successful production  of the  intended artifact; unvalidated
designs  are  compared  by  the  usual  tools  of   academic  debate:
analogies, anecdotes,   scenerios and rhetoric. In  early 1942, there
were  five ideas  on how  to manufacture  fissionable material  for a
bomb; three  uranium isotope  separation techniques:  electomagnetic,
centrifuge   and  gaseous-diffusion;   and   two  plutonium   reactor
techniques: graphite and  heavy water. In  spite of the  considerable
power of theory  in nuclear physics,   there was  no a priori  way to
select the best method;  so all of the ideas were tried, and three of
the methods were  made to work  by 1945. In  computer vision,   there
are   three    substantially   different    approachs:   description,
verification and recognition; all of which may ultimately work.
2.2	Computer Vision Tasks.

	The  overall vision research  problem  I  wish  to
discuss is that of  finding out how to write programs that can see in
the  real  world.    Alternate  vision  research   problems  include:
modeling human perception,   solving visual puzzles,   and developing
advanced  automation techniques.   In order to  approach the problem,
specific programming tasks are proposed and  solutions sought. Please
distingush  the  notion  of   a  research  problem  from  that  of  a
programming task. As will  be illustrated, many  vision tasks can  be
effectively done without vision.  The vision  solution I seek must be
able  to deal  with real  images,   emphasize  the continuity  of the
visual process in time and space, and be general purpose  rather than
ad hoc. These three requirements will  be discussed again later,  and
so for  Mnemosyne, a slogan: Reality, Continuity, Generality. Now for
a quick survey of seven computer vision tasks [see table].

	First, there  is  the robot  chauffer task.  In  1969,   John
McCarthy asked  me to consider the vision  requirements of a computer
controlled car such as he depicted in an essay [see appendix  1]. The
idea  is  that a  user  of  such an  automatic  car  would request  a
destination;  the  robot  would select  a  route  from  an internally
stored road  map; and  then would  proceed to  its destination  using
only  visual data.   The  chauffer,   is  a subordinate  part  of the
McCarthy advice taker scenerio,  about  getting to the airport.   The
problem  involves representing  the  road  map  in the  computer  and
establishing the  correspondence between the map  and visual sight of
the road as the driver  servo's the vehicle along the desired  route.
Lacking  a computer  controlled  car,   the  problem was  immediately
simplified  to  that  of  tracing a  route  along  the  driveways and
parking lots that surround the Stanford A.I. Laboratory.

	Second, there is the  robot explorer. In 1967,   McCarthy and
Lederberg,   published  a description  of a  robot for  exploring the
surface of  the planet  Mars.   The  robot  explorer depicted,    was
designed to run for  long periods of time without  human intervention
because of the  signal transmission time to Mars is as great as forty
minutes and  because  the  23.5  hour Martian  day  would  place  the
vehicle out  of sight  for 12  hour at a  time. The  later difficulty
could  be   overcome  by  a  having  a  set  of  communication  relay
satellites in orbit  around Mars. The task  of the explorer would  be
to drive around mapping  the surface of Mars, looking for interesting
features, and doing various experiments.

	The third vision task  is that of the robot  soldier, tank or
sentry. The problem  has several forms which are quite similar to the
chauffeur,  the  explorer and the machine  assembler.  Although  this
vision task  has not being  explicitly attempted,  the  reader should
note  that  a thorough  solution  to any  of the  other  tasks almost
assures the technology  to solve task 3.  See section 2.X for  futher
discussion of social implications.

	Fourth, the turn table task in to  construct a 3-D model from
a sequence  of 2-D television images taken of  an object rotated on a
turn  table.  The  turntable  task  was  intentionally selected  as  a
simplification of the explorer task.

	Fifth, the classic blocks vision task, first attempted by
Roberts, consists of two parts: first convert a video image
into a line drawing; second, make a selection from a set of predefined
prototype models of blocks that accounts for the line drawing.
[single image vs. multiple images].
[perfect line drawing puzzles: Guzman & Waltz].
[imperfect line drawing analysis]

	Sixth, recognition vision tasks include:
character recognition, face recognition, aircraft recognition,

	Seventh, the Stanford Hand Eye Project has
recently dedicated itself to solving the task of automatic
machine assembly. In particular, the group will try to develope
techniques that will be demonstrated by the fully automatic
assembly of a chain saw gasoline engine.
Where is the part ? and where is the hole ?
Location Task:	Where is it.
Identification Task: What is it.

	Eighth, there are animal vision tasks.
	TABLE OF 3-D COMPUTER VISION TASKS.
---------------------------------------------------------------------
1. The Robot Chauffeur. Cart Task.
	Given a computer  controlled cart and a road map,
	drive the cart along a preselected route,
	without crashing into anything.

2. The Robot Explorer. Cart Task.
	Given a computer  controlled cart,
	explore and map the world,
	without crashing into anything.

3. The Robot Soldier. Cart Task.
	Given a computer controlled vehicle,
	locate and destroy the enemy.


4. Turn Table Task.
	The turn table task in to construct a 3-D model from a
	sequence of 2-D television images taken of an object
	rotated on a turn table.

5. The Blocks Task.
	First, convert a video image into a line drawing;
	Second, identify and locate the blocks in the line drawing.


6. Recognition Tasks.
	Character recognition,
	Face recognition,
	Aircraft recognition,

7. Machine Assembly Tasks.
	Where is the part ? Where is the hole ?
	Location Task:	Where is it.
	Identification Task: What is it.
---------------------------------------------------------------------
2.3	Mobile Robot Vision.
---------------------------------------------------------------------
Chauffer Cart Solutions:
1. Map, predict 2D image, verify features, and solve for camera locus.
2. Map, retrieve 2D image, verify features,and solve for camera locus.
---------------------------------------------------------------------
Explorer Cart Solutions:
1. Photoreconnaissance, correlation and make contour maps.
2. Photoreconnaissance, match and describe, body locus solving.
---------------------------------------------------------------------

	I  will now  propose  two solutions  each  for the  two  cart
tasks: chauffer and explorer; that is four systems in all.

	With  abundant naivete and  energy,   I coded a  simple
edge finder,   which actually could find the left  and right curbs of
the  road in some televison images; however  (sad to relate) the edge
finder was easily fooled and so to make it smarter I began to
put in a model of  the given road.  What followed then, was nearly
four years of trying  to do really good  world modeling for the  sake
of computer vision by verification.

	The cart at  the Stanford Artificial  Intelligence Laboratory
is intended for  outdoors use and consists of four bicycle wheels,  a
piece of  plywood,   two  car  batteries,  a  television camera,    a
television transmitter,   and  a toy airplane  radio receiver.   The
vehicle  being  discussed is  not  "Shakey",   which  belongs  to the
Stanford Reseach  Institute's Artificial Intelligence  Group.   There
are two  A.I. labs near Stanford  and each has  a computer controlled
vehicle.  Logically the cart has three motors which can be  commanded
to run  in one or  the other direction  under computer control.   The
six possible cart  actions are: run forwards, run backwards, steer to
the left,  steer to  the right, pan camera  to the left,  pan  camera
to the right. 

2.4	Vision Systems.

	A  computer vision  system can  be  described as  mediating
between external  perceived images and an internal  world model.  The
two poles (or  operands) of the  system are  called the "bottom"  for
images and the  "top" for the models.  The  "world model" operand can
be identified  even in vision systems that do not advertise it.  Work
that truly lacks  a world model is  not computer vision,   usually it
is image processing.  Given  the two classes of operands,  images and
worlds; there are  three operations: recognition,   verification  and
description; which a general vision system may perform.

	Verification vision is  also called top-down or  model-driven
vision.  The  verification  approach  involves predicting  an  image,
followed by comparing the predicted  image and a perceived image  for
slight  differences which  are  expected but  not  yet measure.
Recognition  vision and descriptive vision  are also called bottom-up
or data-driven vision. Recognition vision is qualitative, what  is in
the  picture   is  determined  by   extracting  a  set   of  features
(qualities)  and  by  classifing  them  according  to  a  essentially
statistical world  model. Description  vision  is quantitative.  Many
theories  are  superficially   different  in  that  they  consist  of
compounding the three basic  modes of vision,  or by using  different
forms of the two basic elements: image and model.

The Vision Mandala.
	1. PREDICT	2D → 3D		synthesis	verification
	2. PERCEIVE	3D → 2D		analysis	revelation
	3. COMPARE			recognition

Three modes of operation on the vision cycle.

1. Revelation Vision - Data Driven Vision.
	(nearly pure bottom up vision).

2. Verification Vision - Model Driven Vision.
	(nearly pure top down vision).

3. Recognition Vision - feature classification.
	(bottom up random access into existing top).

   Vision.
	Heuristic Vision - guess and test.
	Accomodating Vision.
	(first bottom-up, next top-down, then verify and correct).
---------------------------------------------------------------------
The vision system is:
	1. Continuous rather than discrete.
	2. Exact rather than fuzzy; numeric rather than symbolic.
	3. Bidirectional rather than one way.
2.5	The Vision Cycle.

	The vision mediation has three possible modes:
revelation, verification and recognition.

Depending on circumstances,  a vision system should be able to
run  almost  entirely  top-down  (verification  vision) or  bottom-up
(revelation vision).  Verification vision is all that is  required in
a  well   know  and  consquently  predictible   environment;  whereas
revelation  vision is  required in  a brand  new or  rapidly changing
environment.

	Recognition   involves   comparing
perceived  data with predicted  data; such  recognition comparing can
be done on any  of the four  types of 2-D images  or the 3-D  models.
Arcane  recognition  techniques  can  be  avoided  by  improving  the
prediction and the analysis so that matchs are nearly obvious.
2.6	The Nature of Images.

	There are three  basic kinds of  information in a  2-D visual
image:  photometric,   geometric,   and  topological; also  there are
three kinds of  2-D images: raster,  contour,   and mosaic.
The traditional  subject of image  processing involves the  study and
development  of  programs that  enhance,   transform  and  compare 2D
images.  Nearly all such  image processing work can be  subsumed into
computer vision.

---------------------------------------------------------------------
Assumption:	The perceived images are low quality, black and white,
		digitized television images.

Alternatives:	1. High quality electronic imaging device.
		2. Film scanning system.
		3. Active 3-D imaging device.
		4. Non-light devices: sound, radar, neutrinoes, etc.

Discussion:

	The argument in favor of using low  quality, black and white,
television  images is  based on  poverty  rather than  principle. Low
quality television  is the  cheapest electronic  way  to perceive  an
image in real time.

	Although, a super intellectual entities  would have eyes that
could see the  whole electromagnetic spectrum from gamma radiation to
direct current as well  as "voices" that  could broadcast on any  and
all frequency; the video restriction
---------------------------------------------------------------------
	An image contains three basic kinds of data:
topological data, geometric data, and photometric data.

	The quality of the particular computer vision system
that one is condemned to use is a great influence one's
theoretical approach.

	size of image
	photometric accuracy, bits per pixel
	resolution
	speed of image taking
2.7	The Nature of Worlds.

	The rules about the  world that can be assumed a  priori by a
programmer   are  the  laws  of   physics;  programming  a  Newtonian
simulator of the mundane physical  world to a given approximation  is
difficult   but  more   fruitful  than   programming  an   Aristolean
simulator.

(Reality Simulation).
---------------------------------------------------------------------
Assumption: The visual world model should be a 3-D geometric model.

Alternatives:	1. Image memory and 2-D models.
		2. Procedual Knowledge, e.g. Hewett & Winograd.
		3. Semantic knowledge, e.g. Wilkes.
		4. Formal Logic models, e.g McCarthy & Hayes.
		5. Statistical world model, e.g. Duda & Hart.
Discussion:
---------------------------------------------------------------------
Assumption: Partial knowledge is represented by approxination.

Alternatives:	1. Tree of possibilties.
		2. Multi valued logic.
		3. Probablities.

Discussion:
---------------------------------------------------------------------
2.8	Locus Solving.

	1. Camera Locus Solving.
	2. Body Locus Solving.
		Silhouette Cone Intersection.
		Envelope bodies.
	3. Sun Locus Solving.
		(compute it, look at it, shine and shadows).

	The crux  of computer vision  is to deduce  information about
the  world  being  viewed  from images  of  that  world.   The  world
information  most  directly  relevant  to  vision  is   the  physical
location,   extent and  light scattering  properties of  solid opaque
objects;  the location,  orientation and  scales of the  cameras that
takes the pictures;  and the location and  nature of the lights  that
illuminate  the  world. Accordingly, three  important vision problems
are  camera solving, body solving,  and sun  solving.

The macroscopic  world doesn't change  very rapidly; between  any two
world states  there is an intermediate world  state.  Parallax is the
principal means of depth perception.  Parallax is the  alchemist that
converts 2-D images  into 3-D models. Revelation vision  is a process
of  comparing percieved images  taken in sequence  and constructing a
3-D model of the unanticipated objects.
2.9	Related Work.

	Larry Roberts is  justly credited for doing the  seminal work
in Computer  Vision; and although his thesis  appeared over ten years
ago the subject has languished  dependent on and overshadowed by  the
four areas called: Image Processing,   Pattern Recognition,  Computer
Graphics,   and  Artificial  Intelligence.   Outside of  the computer
sciences the  two subjects: psychology  are neurology,   also seek  a
theory of  vision. I will breifly state the relevant aspects of
computer vision in each of these six subject areas; and second
acknowledge the particular authors that influenced my work.
---------------------------------------------------------------------
(Computer Vision and Image Processing).

	Image  processing  involves  the  study  and  development  of
programs that enhance,   transform and compare 2D images.  Nearly all
image processing work can eventually be applied to computer vision.

---------------------------------------------------------------------
(Computer Vision and Pattern Recognition).

	Image pattern recognition involves two steps: feature extraction
and classification.

---------------------------------------------------------------------
(Computer Vision and Computer Graphics).

	Discriptive  computer  vision  is  the  inverse  of  computer
graphics.   The problem of  computer graphics is  to synthesis images
from three dimensional  models; the  problem of discriptive  computer
vision is to analyze images into three dimensional models.

---------------------------------------------------------------------
(Computer Vision and Artificial Intelligence).

	At one  extreme, computer vision  may be discribed  as merely
the problem of  getting visual input hardware properly connected to a
computer; once the computer can "see" a raster of intensities  in its
memory,  the rest  of  the problem  is  artificial intelligence.  The
other extreme  is harder to depict because it requires figuring where
to draw the line between vision software and intelligence software;
one goal I wish to pursue in this chapter is demark such a line.

2.10	Computer Vision and Artificial Intelligence.

	A favorite pastime of technology aficionados
consists of defining the term "Artificial Intelligence".
The founders Minsky and McCarthy coined the phrase;
critics such as Lighthill and Dreyfus 
and advocates Nilsson and Fiegenbaum.

Futurologists such as Herman Kahn, use the term in sentences such as
"True artificial intelligence will not appear until around 2020";
which would seem to leave us, twentieth century people,
with {artificial} artificial intelligence.

	Normal  vision,   as  oppose to  visual  puzzles, is  not  an
Artificial  Intelligence  problem  in  the  sense  that it  does  not
involve     cognition,     verbal  abstraction,   symbolism,
theorem proving,   game playing,  planning,  heuristic programming or
self  programming.   In  fact,   I  feel that  computer  vision, list
processing  and symbolic  integration  will  drop out  of  Artificial
Intelligence.

"The history of progress in the development  of systems for automatic
symbolic   integration  poses  an  interesting   question  about  the
definition of artificial intelligence. Few would argue  that Slagle's
SAINT  program was  a  product of  artificial intelligence  research.
Moses'  SIN program for symbolic integration  seldom needed to resort
to search,  and for  this reason some  people consider  it much  more
powerful (intelligent ?) than  SAINT. Now, Risch (1969) has developed
an  algorithm  for  integrating  many  types  of  expressions.  Risch
considers himself  a  mathematician, not  an artificial  intelligence
researcher.  In your opinion  should Risch's  algorithm be considered
part of the subject matter of artificial intelligence ? If  you would
exclude Risch  from artifial intelligence,  how would you  respond to
the  statement  that  every  artificial  intelligence  program  might
eventually  be dominated  by  a  (more intelligent?)  non  artificial
intelligence algorithm?  If you would  include Risch, would  you also
include the long-division algorithm?"

			- Nils J. Nilsson, problem 4-5;
		Problem-Solving Methods in Artificial Intelligence.

	In answer to Nilsson's  problem,  I would exclude  Risch from
Artificial  Intelligence and  cheerfully look  forward to  the remote
day when all  A.I.  problems are  superceded by specific  programming
techniques.

		(Fiegenbaum Quote).

	The  relation between  Artificial  Intellegence,  experiment,
and  environmental   simulation  is  indirectly  illuminated  by
Fiegenbaum's observation:

	"The design,  implementation, and use  of the  robot hardware
presents  some   difficult,  and  often  expensive,  engineering  and
maintenance problems. If  one is to  work in  this area solving  such
problems it is a  necessary  prelude   but,  more  often   than  not,
unrewarding  because the activity  does not address  the questions of
A.I. reseach  that motivate  the project. Why,  then, build  devices?
Why not simulate  them and their environment? In  fact, the SRI group
has done  good work  in simulating  a  version of  their robot  in  a
simplified environment. The  answer given is  as follows. It  is felt
by  the  SRI  group  that  the  most  unsatisfactory  part  of  their
simulation effort was  the simulation of  the environment. Yet,  they
say that  90% of  the effort  of the simulation  team went  into this
part  of  the  simulation. It  turned  out to  be  very  difficult to
reproduce in an internal representation for a  computer the necessary
richness of environment that  would give rise to interesting behavior
by the highly  adaptive robt.  It is easier  and cheaper  to build  a
hardware robot  to extract what  information it  needs from the  real
world  than to organize  and store a  useful model.  Crudely put, the
SRI group's argument  is that the most  economic and efficient  store
of information about the real world is the real world itself."

					- E. A. Fiegenbaum [ref. X].

	Fiegenbaum's final statement is correct: the  real world is a
good memory of  itself; but his conclusion is in error, because it is
necessary to have  an environmental  simulator in order  to read  the
world. His  opinion, that the building  of the robot hardware  is not
an  integral part  of the  A.I. research;  is very  characteristic of
senior Artificial Intelligence  theorists and leaves the  junior A.I.
experimentalists with the curse of shoddy tools.
2.11	Visual Consciousness.

"For the purpose of  presenting my argument I must  first explain the
basic  premise of sorcery as  don Juan presented  it to me.   He said
that for a sorcerer, the world  of everyday life is not real, or  out
there, as we believe  it is. For a sorcerer, reality  or the world we
all  know, is  only a  discription. For  the sake of  validating this
premise don Juan  concentrated the best  of his efforts into  leading
me to a  genuine conviction that what I held in  mind as the world at
hand was merely a  description of the world;  a description that  had
been pounded into me from the moment I was born."

			- Carlos Castaneda. Journey to Ixtlan.

	The  larger context  of  a  vision  theory depends  on  ones'
opinion about human  counsciousness. In my opinion, mind is a program
that is  running in  the brain.

Now consider what software  is needed to account  for
counsciousness,   the  private life  of the  self that  burns in  our
heads  ? The  so called stream  of counsciousness  consists of little
voice(s) talking,   fragments of music playing,   and a color  visual
display of  the present moment. I believe  that the major computation
being  performed  by  an  intellectual   entity  in  order  to   stay
counscious of its external world is a reality simulation.

	The basic inspiration for this thesis is  a subtle
analogy  between  3-D  computer  graphics  and  human  vision.  First
consider computer graphics,  it is possible to program a  computer to
simulate  the  view  of  a  camera moving  thru  a  simulated  scene.
Architects  look at simulated  building designs,   cartoonist look at
computer simulated commercials,
and   pilots  look  at  simulated   aircraft
carriers.  Second,  the
position of the simulated camera  can be controlled either by  direct
command  or indirectly  by  a  further simulation,    such as  of  an
airplane. In  the 3-D display system, at  the University of Utah, the
position  of  the  simulated  camera  is  kept  coincident  with  the
physical position of the eyes of the viewer.

	Now consider human vision. You are where your eyes are.
The analogy  is that  the  display simulator resembles 
the visual display that goes on inside ones head. The subtlty lies in
identifying analogous elements.

{introspection & mimicry arguments: for and against}.
2.12	Summary of Arguments.

	Vision Problems vs. Vision Tasks.
	Discussion of visual tasks.

THREE REQUIREMENTS: (OF A VISION THEORY).

	1. REALITY.
	   Preference for working with real images rather than
	   with puzzle images (i.e. perfect images).
	2. GENERALITY.
	   Preference for the descriptive approach rather than the
	   classification model.
	3. CONTINUITY.
	   Preference for vision in continuous time and space
	   rather than discrete vision.

THREE MODES: (OF A VISION SYSTEM).

	1. REVELATION ≡ DESCRIPTION.
	2. VERIFICATION.
	3. RECOGNITION.

Argument against a "Vision Language";

2.13	Future Vision Work.

	Significant progress in computer vision will have to await
better computer hardware and better computer graphics software,
specifically world modeling software.

	At Stanford Uninversity, Lynn Quam and Hans Morevac,


	The machine assembly tasks,
At Stanford Research Institute
because the demand for doing practical vision tasks can  be satisfied
with existing ad hoc methods or by not using a visual sensor at all.

	The potential of a computer entertainment industry...

As William Shakespeare and Carl Hewett would agree:
all the world's a stage and all the men, women and robots actors.
2.X	Social Consquences.

	Although, the  political and  social consquences of  computer
vision  are somewhat  more remote  than other  computer applications,
the potential for abuse is so great that I feel that it  is necessary
to try  to develope corresponding  ethics along with the  science and
technology.

During the period of this reseach, 1969 to 1973 inclusive,...

An exceedingly good and exact theory of vision could generate....

The potential benefits of understanding vision
outweigh the potential harm that could be wrought.

	As an engineering project, the construction of a killer robot
is safer than research in biological warfare
and definetly not in the same league as the construction of a doomsday
machine or even as the invention of nuclear weapons.